Practical Session 3: Managing date formats and plotting

Expected learning outcomes

By the end of the session, participants should be able to:

  • Manage surveillance datasets with different date formats

  • Create specific time series objects using tsibble

  • Plot surveillance data against time

You have been provided with one MS Excel file (tsa_practice.xlsx) containing 2 sheets, one for each of two different diseases (dis1, dis2); and one csv file (tsa_pumala.csv) about Puumala virus infections in Finland.

Task 3.1

Assess visually the reported number of cases of dis1 by ISO (epidemiological) week or calendar month for all the data provided (“dis2” is optional).

What diseases do you think dis1 is, judging from the cases’ distribution in time?

Help for Task 3.1

Required source code.

Show the code
# Install pacman if not installed already, and activate it
if (!require("pacman")) install.packages("pacman")

# Install, update and activate libraries
pacman::p_load(
  here, 
  rio, 
  skimr,
  tsibble,
  ISOweek,
  tidyverse
)

Import your data to R from these Excel files and examine them using the well-known str, summary and skim functions

Show the code
dis1 <- import(here("data", "tsa_practice.xlsx"), which = "dis1")
Show the code
str(dis1)
'data.frame':   431 obs. of  3 variables:
 $ year : num  1981 1981 1981 1981 1981 ...
 $ week : num  1 2 3 4 5 6 7 8 9 10 ...
 $ cases: num  1 3 NA 4 3 2 3 4 1 NA ...
Show the code
summary(dis1)
      year           week           cases       
 Min.   :1981   Min.   : 1.00   Min.   :  1.00  
 1st Qu.:1983   1st Qu.:12.50   1st Qu.:  9.00  
 Median :1985   Median :26.00   Median : 28.00  
 Mean   :1985   Mean   :25.86   Mean   : 65.16  
 3rd Qu.:1987   3rd Qu.:39.00   3rd Qu.: 83.50  
 Max.   :1989   Max.   :52.00   Max.   :554.00  
                                NA's   :12      
Show the code
skim(dis1) %>% as_tibble()
# A tibble: 3 × 12
  skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd
  <chr>     <chr>             <int>         <dbl>        <dbl>      <dbl>
1 numeric   year                  0         1           1985.        2.40
2 numeric   week                  0         1             25.9      15.2 
3 numeric   cases                12         0.972         65.2      96.3 
# ℹ 6 more variables: numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>,
#   numeric.p75 <dbl>, numeric.p100 <dbl>, numeric.hist <chr>

Since we will work with time series data, we could use some time to explore the most relevant variable: date (in this case, the year variable). dis1 is a dataset containing weekly counts of a certain disease between 1981 and 1989. A typical year has 52 weeks, thus we can count the number of times each year appear in the data using table, tabyl or count function, your go-to tools for categorical variables

Show the code
table(dis1$year)

1981 1982 1983 1984 1985 1986 1987 1988 1989 
  52   52   52   52   52   52   52   52   15 
Show the code
janitor::tabyl(dis1$year)
 dis1$year  n    percent
      1981 52 0.12064965
      1982 52 0.12064965
      1983 52 0.12064965
      1984 52 0.12064965
      1985 52 0.12064965
      1986 52 0.12064965
      1987 52 0.12064965
      1988 52 0.12064965
      1989 15 0.03480278

Each one provides different outputs and possesses unique capabilities. In general, tabyl should be your preferred basic function, thanks to its tidy integration and format output. It also enables some customization and data manipulation.


R has a collection of packages for handling time series data.

Some of these packages are part of the tidyverse family of packages. Particularly, it includes the tsibble package, which intends to create a data infrastructure for easier manipulation and handling of temporal data, and adapts the principles of tidy data).

According to the help description, a tsibble object is defined by

  • an index, as a variable with inherent ordering from past to present.

  • a key, as a set of variables that define observational units over time.

  • uniquely identified observations by an index and a key.

To convert the original dis1 dataset to a tsibble object, we use the as_tsibble function and specify the index, i.e. the variable specifying the time unit of interest. In our case, this is year and week variables. Variables that are not used for index or key purposes, such as the cases variables, are considered as measured variables.

Show the code
dis1z <-
    dis1 %>%
    mutate(date_index = make_yearweek(year = year, week = week)) %>%
    as_tsibble(index = date_index)

str(dis1z)
tbl_ts [431 × 4] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ year      : num [1:431] 1981 1981 1981 1981 1981 ...
 $ week      : num [1:431] 1 2 3 4 5 6 7 8 9 10 ...
 $ cases     : num [1:431] 1 3 NA 4 3 2 3 4 1 NA ...
 $ date_index: week [1:431] 1981 W01, 1981 W02, 1981 W03, 1981 W04, 1981 W05, 1981 W0...
   ..@ week_start: num 1
 - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:1] 
  .. ..$ : int [1:431] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..@ ptype: int(0) 
 - attr(*, "index")= chr "date_index"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "date_index"
 - attr(*, "interval")= interval [1:1] 1W
  ..@ .regular: logi TRUE
Show the code
head(dis1z, 10)
# A tsibble: 10 x 4 [1W]
    year  week cases date_index
   <dbl> <dbl> <dbl>     <week>
 1  1981     1     1   1981 W01
 2  1981     2     3   1981 W02
 3  1981     3    NA   1981 W03
 4  1981     4     4   1981 W04
 5  1981     5     3   1981 W05
 6  1981     6     2   1981 W06
 7  1981     7     3   1981 W07
 8  1981     8     4   1981 W08
 9  1981     9     1   1981 W09
10  1981    10    NA   1981 W10

You can see how the number of dis1 cases is distributed in time using the ggplot package.

Show the code
ggplot(data = dis1z) +
    geom_line(mapping = aes(x = date_index, y = cases)) +
    scale_x_yearweek(date_labels = "%Y") +
    labs(x = "Date", y = "Number of Cases", title = "Disease 1 data") +
    theme_bw() + 
        theme(
            plot.title = element_text(face = "bold", 
                                      size = 12),
            legend.background = element_rect(fill = "white", 
                                             size = 4, 
                                             colour = "white"),
            # legend.justification = c(0, 1),
            legend.position = "bottom",
            panel.grid.minor = element_blank()
        )

Show the code
# We can save theme modifications into a single object and then use it in new plots
tsa_theme <- theme_bw() + 
        theme(
            plot.title = element_text(face = "bold", 
                                      size = 12),
            legend.background = element_rect(fill = "white", 
                                             size = 4, 
                                             colour = "white"),
            # legend.justification = c(0, 1),
            legend.position = "bottom",
            panel.grid.minor = element_blank()
        )

Note from the plot that there are missing values in the data.

If data points are collected at regular time interval, we can correct missing data by providing a default value or interpolating missing values from other data. The tsibble package has the fill_gaps function, which fills missing values with a pre-specified default value. It is quite common to replaces NAs with its previous observation for each time point in a time series analysis, which is easily done using the fill function from tidyr package.

A quick overview of implicit missing values with tsibble is available on vignette("implicit-na").

As the focus of this training is on understanding principles of time series analysis rather than visualisation of time series, we will mainly use the ggplot functions for the remaining exercises.


Let’s now work with the Pumala data

Show the code
dis3 <- import(here("data", "tsa_pumala.csv"))

Here you have one variable for the year, one variable for the month and one variable with the complete date in days, but in a text (chr class) format.

Show the code
str(dis3)
'data.frame':   3844 obs. of  3 variables:
 $ year    : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
 $ month   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ date_str: chr  "02/01/1995" "02/01/1995" "03/01/1995" "03/01/1995" ...
Show the code
head(dis3)
  year month   date_str
1 1995     1 02/01/1995
2 1995     1 02/01/1995
3 1995     1 03/01/1995
4 1995     1 03/01/1995
5 1995     1 04/01/1995
6 1995     1 04/01/1995
Show the code
summary(dis3)
      year          month          date_str        
 Min.   :1995   Min.   : 1.000   Length:3844       
 1st Qu.:1998   1st Qu.: 6.000   Class :character  
 Median :2000   Median : 9.000   Mode  :character  
 Mean   :2000   Mean   : 7.907                     
 3rd Qu.:2002   3rd Qu.:11.000                     
 Max.   :2004   Max.   :12.000                     

With the complete date, you can generate ISO weeks, but first you need the date variable to be converted from text to something R recognises as a date. The lubridate package has a number of convenient functions for converting text strings to dates.

Show the code
str(dis3$date_str)
 chr [1:3844] "02/01/1995" "02/01/1995" "03/01/1995" "03/01/1995" ...
Show the code
dis3 <- dis3 %>%
    mutate(my_date = dmy(date_str))

str(dis3$my_date)
 Date[1:3844], format: "1995-01-02" "1995-01-02" "1995-01-03" "1995-01-03" "1995-01-04" ...

As a complementary note, check the help for strftime to learn about different date formats in R (?strftime).

To convert to ISO week, we can use the ISOweek function from the ISOweek package, which creates a new variable representing the ISO week. We could then also create a new variable representing the first Monday of each ISO week. Although the popular lubridate package has an isoweek function (there is also a similar function in the surveillance package), we use the ISOweek package here as it has the ISOweek2date function. If epidemiological weeks are required, use the EpiWeek package.

Show the code
dis3 <- dis3 %>%
    mutate(
        date_isowk = ISOweek(my_date),
        isodate = ISOweek2date(paste(date_isowk, "-1", sep = ""))
    )

head(dis3)
  year month   date_str    my_date date_isowk    isodate
1 1995     1 02/01/1995 1995-01-02   1995-W01 1995-01-02
2 1995     1 02/01/1995 1995-01-02   1995-W01 1995-01-02
3 1995     1 03/01/1995 1995-01-03   1995-W01 1995-01-02
4 1995     1 03/01/1995 1995-01-03   1995-W01 1995-01-02
5 1995     1 04/01/1995 1995-01-04   1995-W01 1995-01-02
6 1995     1 04/01/1995 1995-01-04   1995-W01 1995-01-02

The paste command concatenates text.

Here we are adding "-01" onto the end of the ISO week variable (which is formatted something like "1995-W01"), to indicate that we want the first day of that week, and then supplying that to the ISOweek2date function, which converts that to a date.

Have a look at the new variables that have been created. date_isowk is the ISO week variable, in string format, and isodate is the Monday of each ISO week. Note that the years 1998 and 2004 each have an ISO week 53.

You have several observations in the same week since you have data from different days and one value corresponds to one case. Use the count function to aggregate the data.

Show the code
dis3_v2 <- dis3 %>%
    count(date_isowk)

head(dis3_v2)
  date_isowk  n
1   1995-W01 13
2   1995-W02 15
3   1995-W03  6
4   1995-W04  5
5   1995-W05 11
6   1995-W06 10
Show the code
dis3z <- dis3_v2 %>%
    mutate(date_index = yearweek(x = date_isowk, week_start = 1L)) %>%
    as_tsibble(index = date_index)

str(dis3z)
tbl_ts [491 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ date_isowk: chr [1:491] "1995-W01" "1995-W02" "1995-W03" "1995-W04" ...
 $ n         : int [1:491] 13 15 6 5 11 10 7 6 3 1 ...
 $ date_index: week [1:491] 1995 W01, 1995 W02, 1995 W03, 1995 W04, 1995 W05, 1995 W0...
   ..@ week_start: int 1
 - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:1] 
  .. ..$ : int [1:491] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..@ ptype: int(0) 
 - attr(*, "index")= chr "date_index"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "date_index"
 - attr(*, "interval")= interval [1:1] 1W
  ..@ .regular: logi TRUE
Show the code
head(dis3z, 10)
# A tsibble: 10 x 3 [1W]
   date_isowk     n date_index
   <chr>      <int>     <week>
 1 1995-W01      13   1995 W01
 2 1995-W02      15   1995 W02
 3 1995-W03       6   1995 W03
 4 1995-W04       5   1995 W04
 5 1995-W05      11   1995 W05
 6 1995-W06      10   1995 W06
 7 1995-W07       7   1995 W07
 8 1995-W08       6   1995 W08
 9 1995-W09       3   1995 W09
10 1995-W10       1   1995 W10
Show the code
ggplot(data = dis3z) +
    geom_line(mapping = aes(x = date_index, y = n)) +
    scale_x_yearweek(date_labels = "%Y") +
    labs(x = "Date", y = "Number of Cases", title = "Disease 1 data") +
    tsa_theme

Aggregating cases by month is another possibility.

Show the code
dis3_agg <- dis3 %>%
    count(year, month)

dis3z_v2 <- dis3_agg %>%
    mutate(date_index = make_yearmonth(year = year, month = month)) %>%
    as_tsibble(index = date_index)

ggplot(data = dis3z_v2) +
    geom_line(mapping = aes(x = date_index, y = n)) +
    scale_x_yearweek(date_labels = "%Y") +
    labs(x = "Date", y = "Number of Cases", title = "Disease 1 data") +
    tsa_theme

Help for Optional Task 3.1.1

Import your data to R from the dis2 excel file.

Show the code
dis2 <- import(here("data", "tsa_practice.xlsx"), which = "dis2")

Inspect the data.

Show the code
str(dis2)
'data.frame':   44 obs. of  13 variables:
 $ year: num  1928 1929 1930 1931 1932 ...
 $ Jan : num  609 288 316 750 146 ...
 $ Feb : num  1516 241 781 2010 270 ...
 $ Mar : num  4952 347 1870 4858 621 ...
 $ Apr : num  7466 378 5309 6172 1096 ...
 $ May : num  11155 498 7499 7095 2271 ...
 $ Jun : num  7002 324 5386 4238 2537 ...
 $ Jul : num  1315 151 1386 907 1081 ...
 $ Aug : num  189 69 211 165 288 139 95 358 213 247 ...
 $ Sep : num  74 44 107 43 118 50 47 67 87 84 ...
 $ Oct : num  119 42 128 54 149 55 45 136 99 90 ...
 $ Nov : num  287 34 219 93 594 71 86 333 154 136 ...
 $ Dec : num  324 121 423 134 1183 ...
Show the code
head(dis2)
  year  Jan  Feb  Mar  Apr   May  Jun  Jul Aug Sep Oct Nov  Dec
1 1928  609 1516 4952 7466 11155 7002 1315 189  74 119 287  324
2 1929  288  241  347  378   498  324  151  69  44  42  34  121
3 1930  316  781 1870 5309  7499 5386 1386 211 107 128 219  423
4 1931  750 2010 4858 6172  7095 4238  907 165  43  54  93  134
5 1932  146  270  621 1096  2271 2537 1081 288 118 149 594 1183
6 1933 1960 4699 9635 9573  6605 2697  601 139  50  55  71  131
Show the code
summary(dis2)
      year           Jan              Feb               Mar         
 Min.   :1928   Min.   :  39.0   Min.   :   52.0   Min.   :   57.0  
 1st Qu.:1939   1st Qu.: 195.2   1st Qu.:  256.5   1st Qu.:  416.8  
 Median :1950   Median : 302.0   Median :  692.0   Median : 1190.0  
 Mean   :1950   Mean   : 903.8   Mean   : 1750.2   Mean   : 3357.4  
 3rd Qu.:1960   3rd Qu.:1478.2   3rd Qu.: 2083.2   3rd Qu.: 5188.8  
 Max.   :1971   Max.   :6336.0   Max.   :13226.0   Max.   :25826.0  
      Apr               May               Jun              Jul        
 Min.   :   78.0   Min.   :   83.0   Min.   :  79.0   Min.   :  35.0  
 1st Qu.:  622.2   1st Qu.:  700.8   1st Qu.: 724.2   1st Qu.: 340.0  
 Median : 1577.0   Median : 2110.0   Median :1809.5   Median : 598.5  
 Mean   : 3891.2   Mean   : 3379.6   Mean   :2246.2   Mean   : 713.5  
 3rd Qu.: 6605.2   3rd Qu.: 5809.0   3rd Qu.:3153.5   3rd Qu.:1027.8  
 Max.   :22741.0   Max.   :11155.0   Max.   :7002.0   Max.   :1975.0  
      Aug              Sep              Oct             Nov        
 Min.   : 28.00   Min.   : 18.00   Min.   : 11.0   Min.   :  12.0  
 1st Qu.: 98.75   1st Qu.: 43.75   1st Qu.: 53.5   1st Qu.:  56.5  
 Median :187.00   Median : 70.50   Median : 82.5   Median : 112.5  
 Mean   :189.75   Mean   : 79.73   Mean   :100.5   Mean   : 193.0  
 3rd Qu.:245.50   3rd Qu.:107.50   3rd Qu.:124.2   3rd Qu.: 236.0  
 Max.   :453.00   Max.   :184.00   Max.   :354.0   Max.   :1050.0  
      Dec        
 Min.   :  21.0  
 1st Qu.: 101.8  
 Median : 160.0  
 Mean   : 447.9  
 3rd Qu.: 572.5  
 Max.   :2996.0  
Show the code
View(dis2)

You have separate columns containing the counts. To plot this data, you need to first reshape your dataset by converting it from the current wide format to a long format.

The function pivot_longer can perform such transformation.

Show the code
dis2l <- dis2 %>%
  pivot_longer(
      cols = -year, 
      names_to = "month", 
      values_to = "case"
  ) %>%
  mutate(month = as_factor(month)) %>%  # as_factor sets levels in the order they appear
  arrange(year, month)

str(dis2l)
tibble [528 × 3] (S3: tbl_df/tbl/data.frame)
 $ year : num [1:528] 1928 1928 1928 1928 1928 ...
 $ month: Factor w/ 12 levels "Jan","Feb","Mar",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ case : num [1:528] 609 1516 4952 7466 11155 ...
Show the code
head(dis2l)
# A tibble: 6 × 3
   year month  case
  <dbl> <fct> <dbl>
1  1928 Jan     609
2  1928 Feb    1516
3  1928 Mar    4952
4  1928 Apr    7466
5  1928 May   11155
6  1928 Jun    7002
Show the code
dis2l_agg <- dis2l %>%
    mutate(date_index = make_yearmonth(year = year, month = month)) %>%
    as_tsibble(index = date_index)

head(dis2l_agg)
# A tsibble: 6 x 4 [1M]
   year month  case date_index
  <dbl> <fct> <dbl>      <mth>
1  1928 Jan     609  1928 ene.
2  1928 Feb    1516  1928 feb.
3  1928 Mar    4952  1928 mar.
4  1928 Apr    7466  1928 abr.
5  1928 May   11155  1928 may.
6  1928 Jun    7002  1928 jun.
Show the code
ggplot(data = dis2l_agg) +
    geom_line(mapping = aes(x = date_index, y = case)) +
    scale_x_yearweek(date_labels = "%Y") +
    labs(x = "Date", y = "Number of Cases", title = "Disease 3 data") +
    tsa_theme

dis1 corresponds to salmonellosis cases, dis2 to measles cases in New York.